{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#选择合适的index类型\n",
    "选择index类型并没有一套精准的法则可以依据，需要根据自己的实际情况选取。下面的几个问题可以作为选取index的参考。\n",
    "\n",
    "##是否需要精确的结果\n",
    "如果需要，应该使用“Flat”\n",
    "只有 IndexFlatL2 能确保返回精确结果。一般将其作为baseline与其他索引方式对比，以便在精度和时间开销之间做权衡。    \n",
    "不支持add_with_ids，如果需要，可以用“IDMap, Flat”。  \n",
    "支持GPU。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "#导入faiss\n",
    "import sys\n",
    "sys.path.append('/home/maliqi/faiss/python/')\n",
    "import faiss\n",
    "\n",
    "#数据\n",
    "import numpy as np \n",
    "d = 512          #维数\n",
    "n_data = 2000   \n",
    "np.random.seed(0) \n",
    "data = []\n",
    "mu = 3\n",
    "sigma = 0.1\n",
    "for i in range(n_data):\n",
    "    data.append(np.random.normal(mu, sigma, d))\n",
    "data = np.array(data).astype('float32')\n",
    "\n",
    "#ids, 6位随机数\n",
    "ids = []\n",
    "start = 100000\n",
    "for i in range(data.shape[0]):\n",
    "    ids.append(start)\n",
    "    start += 100\n",
    "ids = np.array(ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   0  798  879  223  981 1401 1458 1174  919   26]\n",
      " [   1  981 1524 1639 1949 1472 1162  923  840  300]\n",
      " [   2 1886  375 1351  518 1735 1551 1958  390 1695]\n",
      " [   3 1459  331  389  655 1943 1483 1723 1672 1859]\n",
      " [   4   13  715 1470  608  459  888  850 1080 1654]]\n"
     ]
    }
   ],
   "source": [
    "#不支持add_with_ids\n",
    "index = faiss.index_factory(d, \"Flat\")\n",
    "index.add(data)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[100000 179800 187900 122300 198100 240100 245800 217400 191900 102600]\n",
      " [100100 198100 252400 263900 294900 247200 216200 192300 184000 130000]\n",
      " [100200 288600 137500 235100 151800 273500 255100 295800 139000 269500]\n",
      " [100300 245900 133100 138900 165500 294300 248300 272300 267200 285900]\n",
      " [100400 101300 171500 247000 160800 145900 188800 185000 208000 265400]]\n"
     ]
    }
   ],
   "source": [
    "index = faiss.index_factory(d, \"IDMap, Flat\")\n",
    "index.add_with_ids(data, ids)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind)   # 返回的结果是我们自己定义的id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##关心内存开销\n",
    "需要注意的是faiss在索引时必须将index读入内存。\n",
    "###如果不在意内存占用空间，使用“HNSWx”\n",
    "如果内存空间很大，数据库很小，HNSW是最好的选择，速度快，精度高，一般4<=x<=64。不支持add_with_ids，不支持移除向量，不需要训练，不支持GPU。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 879  981   26 1132  807 1639   28 1334 1832 1821]\n",
      " [   1  981 1524 1639 1949 1472 1162  923  840  300]\n",
      " [   2 1886  375 1351  518 1958  390 1695 1707 1080]\n",
      " [   3 1459  331  389  655 1483 1723 1672 1859  650]\n",
      " [   4   13  715 1470  608  459 1080 1654  665  154]]\n"
     ]
    }
   ],
   "source": [
    "index = faiss.index_factory(d, \"HNSW8\")\n",
    "index.add(data)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###如果稍微有点在意，使用“..., Flat“\n",
    "\"...\"是聚类操作，聚类之后将每个向量映射到相应的bucket。该索引类型并不会保存压缩之后的数据，而是保存原始数据，所以内存开销与原始数据一致。通过nprobe参数控制速度/精度。  \n",
    "支持GPU,但是要注意，选用的聚类操作必须也支持。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   0  879  981 1401  919  143    2  807 1515 1393]\n",
      " [   1  511 1504  987  747  422 1911  638  851 1198]\n",
      " [   2  879  807  981 1401 1143  733  441 1324 1280]\n",
      " [   3  740  155 1337 1578 1181 1743  290  588 1340]\n",
      " [   4 1176  256 1186  574 1459  218  480 1828  942]]\n"
     ]
    }
   ],
   "source": [
    "index = faiss.index_factory(d, \"IVF100, Flat\")\n",
    "index.train(data)\n",
    "index.add(data)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###如果很在意，使用”PCARx,...,SQ8“\n",
    "如果保存全部原始数据的开销太大，可以用这个索引方式。包含三个部分，  \n",
    "1.降维  \n",
    "2.聚类  \n",
    "3.scalar 量化，每个向量编码为8bit\n",
    "不支持GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   0  671  196 1025  624 1521  724  879 1281  533]\n",
      " [   1 1008  698  206  101  657  294  383  700  574]\n",
      " [   2 1594  754 1850  266  559  154 1723 1949 1910]\n",
      " [   3 1758  820  869 1067   14  211 1214   78 1445]\n",
      " [   4 1457  466  557 1604 1951  912  736 1974  836]]\n"
     ]
    }
   ],
   "source": [
    "index = faiss.index_factory(d, \"PCAR16,IVF50,SQ8\")  #每个向量降为16维\n",
    "index.train(data)\n",
    "index.add(data)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###如果非常非常在意，使用\"OPQx_y,...,PQx\"\n",
    "y需要是x的倍数，一般保持y<=d，y<=4*x。\n",
    "支持GPU。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   0 1686 1186 1552   47  517 1563 1738 1748  125]\n",
      " [   1  747 1816   41 1599  380 1179  803 1964  422]\n",
      " [   2 1610 1886  928  397  874  676  535 1401  929]\n",
      " [   3  548   89  509 1337  865 1472 1210 1181 1578]\n",
      " [   4  260 1781 1001 1179   41   20  747 1803 1055]]\n"
     ]
    }
   ],
   "source": [
    "index = faiss.index_factory(d, \"OPQ32_512,IVF50,PQ32\")  \n",
    "index.train(data)\n",
    "index.add(data)\n",
    "dis, ind = index.search(data[:5], 10)\n",
    "print(ind)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##数据集的大小\n",
    "在高效检索的index中，聚类是其中的基础操作，数据量的大小主要影响聚类过程。\n",
    "###如果小于1M， 使用\"...,IVFx,...\"\n",
    "N是数据集中向量个数，x一般取值[4*sqrt(N),16*sqrt(N)],需要30*x ～ 256*x个向量的数据集去训练。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###如果在1M-10M，使用\"...,IMI2x10,...\"\n",
    "使用k-means将训练集聚类为2^10个类，但是执行过程是在数据集的两半部分独立执行，即聚类中心有2^(2*10)个。\n",
    "###如果在10M-100M，使用\"...,IMI2x12,...\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
